Executive Summary
CytoAtlas is a comprehensive computational resource that maps cytokine and secreted protein signaling activity across ~29 million human cells and ~31,000 bulk RNA-seq samples from six independent datasets: two bulk RNA-seq resources (GTEx, TCGA) and four single-cell compendia (CIMA, Inflammation Atlas, scAtlas, parse_10M) spanning healthy donors, inflammatory diseases, cancers, and cytokine perturbations. The system uses linear ridge regression against experimentally derived signature matrices to infer activity — producing fully interpretable, conditional z-scores rather than black-box predictions.
Key results:
- 1,213 signatures (43 CytoSig + 1,170 SecAct), plus 178 cell-type-specific LinCytoSig variants, validated across 6 independent datasets
- Spearman correlations reach ρ=0.6–0.9 for well-characterized cytokines (IL1B, TNFA, VEGFA, TGFB family)
- Cross-dataset consistency demonstrates signatures generalize across CIMA, Inflammation Atlas Main, scAtlas, GTEx, and TCGA
- SecAct achieves the highest median correlations in 5 of 6 datasets (independence-corrected median ρ=0.31–0.46)
Table of Contents
1. System Architecture and Design Rationale
1.1 Architecture and Processing [Detailed Architecture ↓]
Linear interpretability over complex models. Ridge regression (L2-regularized linear regression) was chosen deliberately over methods like autoencoders, graph neural networks, or foundation models. The resulting activity z-scores are conditional on the specific genes in the signature matrix, meaning every prediction can be traced to a weighted combination of known gene responses. This is critical for biological interpretation — a scientist can ask “which genes drive the IFNG activity score in this sample?” and get a direct answer.
Reproducibility through separation of concerns. The system is divided into independent components, each chosen for the constraints of HPC/SLURM infrastructure:
| Component | Technology | Purpose | Rationale |
|---|---|---|---|
| Pipeline | Python + CuPy (GPU) | Activity inference | 10–34x speedup over NumPy; batch-streams H5AD files (500K–1M cells/batch) with projection matrix held on GPU; automatic CPU fallback when GPU unavailable |
| Storage | DuckDB (3 databases, 68 tables) | Columnar analytics | Single-file databases require no server — essential on HPC where database servers are unavailable; each database regenerates independently without affecting others |
| API | FastAPI (262 endpoints) | RESTful data access | Async I/O for concurrent DuckDB queries; automatic OpenAPI documentation; Pydantic request validation; lifespan management for resource initialization |
| Frontend | React 19 + TypeScript | Interactive exploration (12 pages) | Migrated from 25K-line vanilla JS SPA to 11.4K lines (54% reduction) with type safety, component reuse, and lazy-loaded routing |
Processing scale. Ridge regression (λ=5×105) is applied using secactpy.ridge() against each signature matrix. For single-cell data, expression is first aggregated to pseudobulk (donor or donor×celltype level), then genes are intersected with the signature matrix (CytoSig: ~4,860 genes; SecAct: ~7,450 genes). The resulting z-scored activity coefficients are compared to target gene expression via Spearman correlation across donors.
| Dataset | Cells/Samples | Processing Time | Hardware |
|---|---|---|---|
| GTEx | 19,788 bulk samples | ~10min | A100 80GB |
| TCGA | 11,069 bulk samples | ~10min | A100 80GB |
| CIMA | 6.5M cells | ~2h | A100 80GB |
| Inflammation Atlas (main/val/ext) | 6.3M cells | ~2h | A100 80GB |
| scAtlas Normal | 2.3M cells | ~1h | A100 80GB |
| scAtlas Cancer | 4.1M cells | ~1h | A100 80GB |
| parse_10M | 9.7M cells | ~3h | A100 80GB |
Total: ~29M single cells + ~31K bulk RNA-seq samples, processed through ridge regression against 3 signature matrices (CytoSig: 43 cytokines, LinCytoSig: 178 cell-type-specific, SecAct: 1,170 secreted proteins). Processing Time = wall-clock time for full activity inference on a single NVIDIA A100 GPU (80 GB VRAM). For bulk datasets (GTEx/TCGA), ridge regression is applied with within-tissue/within-cancer mean centering to remove tissue-level variation. See Section 2.1 for per-dataset details.
1.2 Validation Strategy
CytoAtlas validates at four aggregation levels, each testing whether predicted activity correlates with target gene expression (Spearman ρ) across independent samples:
| Level | Description | Datasets | Report Section |
|---|---|---|---|
| Donor pseudobulk | One value per donor, averaging across cell types | CIMA, Inflammation Atlas Main, scAtlas Normal/Cancer | §4.1, §4.3 |
| Donor × cell-type | Stratified by cell type within each donor | CIMA, Inflammation Atlas Main, scAtlas Normal/Cancer | §4.7 |
| Per-tissue / per-cancer | Median-of-medians across tissues or cancer types | GTEx (29 tissues), TCGA (33 cancer types) | §4.2, §4.3 |
| Cross-platform | Bulk vs pseudobulk concordance per tissue/cancer | GTEx vs scAtlas Normal, TCGA vs scAtlas Cancer | §4.4 |
All statistics use independence-corrected values — preventing inflation from repeated measures across tissues, cancer types, or cell types. CytoSig vs SecAct comparisons use Mann-Whitney U (total) and Wilcoxon signed-rank (32 matched targets) with BH-FDR correction. See Section 3.3 for the validation philosophy and Section 4 for full results.
Why independence correction matters: Pooling across tissues or cancer types inflates correlations through confounding. For example, GTEx pooled CytoSig median ρ (0.211) is 40% higher than the independence-corrected by-tissue value (0.151); SecAct shows +30% inflation (0.394 vs 0.304). All results in this report use the corrected values. For a detailed comparison of pooled vs independent levels, including inflation magnitude and finer cell-type stratification, see the Section 4.1 statistical supplement.
2. Dataset Catalog
2.1 Datasets and Scale [detailed analytics]
| # | Dataset | Type | Cells/Samples | Donors | Cell Types | Reference |
|---|---|---|---|---|---|---|
| 1 | GTEx | Bulk RNA-seq | 19,788 samples | 946 donors | — | GTEx Consortium, v11 |
| 2 | TCGA | Bulk RNA-seq | 11,069 samples | 10,274 donors | — | TCGA PanCancer |
| 3 | CIMA | scRNA-seq | 6,484,974 | 421 donors | 27 L2 / 100+ L3 | J. Yin et al., Science, 2026 |
| 4 | Inflammation Atlas Main | scRNA-seq | 4,918,140 | 817 samples* | 66+ | Jimenez-Gracia et al., Nature Medicine, 2026 |
| 5 | Inflammation Atlas Val | scRNA-seq | 849,922 | 144 samples* | 66+ | Validation cohort |
| 6 | Inflammation Atlas Ext | scRNA-seq | 572,872 | 86 samples* | 66+ | External cohort |
| 7 | scAtlas Normal | scRNA-seq | 2,293,951 | 317 donors | 102 subCluster | Q. Shi et al., Nature, 2025 |
| 8 | scAtlas Cancer | scRNA-seq | 4,146,975 | 717 donors (601 tumor-only) | 162 cellType1 | Q. Shi et al., Nature, 2025 |
| 9 | parse_10M | scRNA-seq | 9,697,974 | 12 donors × 90 cytokines (+PBS control) | 18 PBMC types | Oesinghaus et al., bioRxiv, 2026 |
Grand total: ~29 million single cells + ~31K bulk samples from 6 independent studies (9 datasets), 100+ cell types.
* Inflammation Atlas does not provide donor-level identifiers; the 817/144/86 values are sample counts. The donor–sample relationship is unknown, so correlations use sampleID as the independent unit.
2.2 Disease and Condition Categories
CIMA (421 healthy donors): Healthy population atlas with paired blood biochemistry (19 markers: ALT, AST, glucose, lipid panel, etc.) and plasma metabolomics (1,549 features). Enables age, BMI, sex, and smoking correlations with cytokine activity.
Inflammation Atlas (20 diseases): RA, SLE, Sjogren's, PSA, Crohn's, UC, COVID-19, Sepsis, HIV, HBV, BRCA, CRC, HNSCC, NPC, COPD, Cirrhosis, MS, Asthma, Atopic Dermatitis
scAtlas Normal (317 donors): 35 organs, 12 tissues with ≥20 donors for per-organ stratification (Breast 124, Lung 97, Colon 65, Heart 52, Liver 43, etc.)
scAtlas Cancer (717 donors, 601 tumor-only): 29 cancer types, 11 with ≥20 tumor-only donors for per-cancer stratification (HCC 88, PAAD 58, CRC 51, ESCA 48, HNSC 39, LUAD 36, NPC 36, KIRC 31, BRCA 30, ICC 29, STAD 27)
parse_10M: 90 cytokines × 12 donors — independent in vitro perturbation dataset for comparison. A considerable portion of cytokines (~58%) are produced in E. coli, with the remainder from insect (Sf21, 12%) and mammalian (CHO, NS0, HEK293, ~30%) expression systems. Because exogenous perturbagens may induce effects differing from endogenously produced cytokines, parse_10M serves as an independent comparison rather than strict ground truth. CytoSig/SecAct has a potential advantage in this regard, as it infers relationships directly from physiologically relevant samples.
2.3 Signature Matrices
| Matrix | Targets | Construction | Reference |
|---|---|---|---|
| CytoSig | 43 cytokines | Median log2FC across all experimental bulk RNA-seq | Jiang et al., Nature Methods, 2021 |
| LinCytoSig | 178 (45 cell types × 1–13 cytokines) | Cell-type-stratified median from CytoSig database (methodology) | This work |
| SecAct | 1,170 secreted proteins | Median global Moran's I across 1,000 Visium datasets | Ru et al., Nature Methods, 2026 (in press) |
3. Scientific Value Proposition
3.1 What Makes CytoAtlas Different from Deep Learning Approaches?
Most single-cell analysis tools use complex models (VAEs, GNNs, transformers) that produce aggregated, non-linear representations difficult to interpret biologically. CytoAtlas takes the opposite approach:
| Property | CytoAtlas (Ridge Regression) | Typical DL Approach |
|---|---|---|
| Model | Linear (z = Xβ + ε) | Non-linear (multi-layer NN) |
| Interpretability | Every gene's contribution is a coefficient | Feature importance approximated post-hoc |
| Conditionality | Activity conditional on specific gene set | Latent space mixes all features |
| Confidence | Permutation-based z-scores with CI | Often point estimates only |
| Generalization | Tested across 6 independent cohorts | Often held-out splits of same cohort |
| Bias | Transparent — limited by signature matrix genes | Hidden in architecture and training data |
The key insight: CytoAtlas is not trying to replace DL-based tools. It provides an orthogonal, complementary signal that a human scientist can directly inspect. When CytoAtlas says "IFNG activity is elevated in CD8+ T cells from RA patients," you can verify this by checking the IFNG signature genes in those cells.
3.2 What Scientific Questions Does CytoAtlas Answer?
- Which cytokines are active in which cell types across diseases? — IL1B/TNFA in monocytes/macrophages, IFNG in CD8+ T and NK cells, IL17A in Th17, VEGFA in endothelial/tumor cells, TGFB family in stromal cells — quantified across 20 diseases, 35 organs, and 15 cancer types.
- Are cytokine activities consistent across independent cohorts? — Yes. IL1B, TNFA, VEGFA, and TGFB family show consistent positive correlations across all 6 validation datasets (Figure 8).
- Does cell-type-specific biology matter for cytokine inference? — For select immune types, yes: LinCytoSig improves prediction for Basophils (+0.21 Δρ), NK cells (+0.19), and DCs (+0.18), but global CytoSig wins overall (Figures 11–12).
- Which secreted proteins beyond cytokines show validated activity? — SecAct (1,170 targets) achieves the highest correlations across all datasets (median ρ=0.33–0.49), with novel validated targets like Activin A (ρ=0.98), CXCL12 (ρ=0.92), and BMP family (Figure 13).
- Can we predict treatment response from cytokine activity? — We are incorporating cytokine-blocking therapy outcomes from bulk RNA-seq to test whether predicted cytokine activity associates with therapy response. Additionally, Inflammation Atlas responder/non-responder labels enable treatment response prediction using cytokine activity profiles as features.
3.3 Validation Philosophy
CytoAtlas validates against a simple but powerful principle: if CytoSig predicts high IFNG activity for a sample, that sample should have high IFNG gene expression. This expression-activity correlation is computed via Spearman rank correlation across donors/samples.
This is a conservative validation — it only captures signatures where the target gene itself is expressed. Signatures that act through downstream effectors would not be captured, meaning our validation underestimates true accuracy.
4. Validation Results
4.1 Overall Performance Summary [Full Details]
PRIMARY independent level: The summary table above reports results at each dataset’s PRIMARY independent level — the aggregation level where samples are fully independent (each donor counted once). This ensures correlation statistics are not inflated by donor duplication. See the “Primary Level” column for each dataset’s level.
How “N Targets” is determined: A target is included in the validation for a given atlas only if (1) the target’s signature genes overlap sufficiently with the atlas gene expression matrix, and (2) the target gene itself is expressed in enough samples to compute a meaningful Spearman correlation. Targets whose gene is absent or not detected in a dataset are excluded. CytoSig defines 43 cytokines and SecAct defines 1,170 secreted proteins. Inflammation Atlas Main retains only 33 of 43 CytoSig targets and 805 of 1,170 SecAct targets because 10 cytokine genes (BDNF, BMP4, CXCL12, GCSF, IFN1, IL13, IL17A, IL36, IL4, WNT3A) are not sufficiently expressed in these blood/PBMC samples.
Stratified levels (GTEx by_tissue, TCGA primary_by_cancer): Correlations are computed within each tissue/cancer type (ensuring independence), then summarized across groups. N Targets counts unique targets at the “all” aggregate level. Finer per-tissue or per-cancer breakdowns are available in Section 4.3 below.
4.2 Cross-Dataset Comparison: CytoSig vs SecAct [Statistical Methods]
Why does SecAct appear to underperform CytoSig in Inflammation Atlas Main?
This is a composition effect, not a genuine performance gap, confirmed by two complementary statistical tests:
Total comparison (Mann–Whitney U test): Compares the full ρ distributions of CytoSig (43 cytokine signatures) vs SecAct (~1,170 secreted protein signatures) using independence-corrected values. For GTEx/TCGA, each target’s representative ρ is the median across per-tissue/cancer values (median-of-medians); for other datasets, donor_only/tumor_only ρ is used directly. SecAct achieves a significantly higher median ρ in 5 of 6 datasets (GTEx: p = 4.76 × 10−4; TCGA: p = 2.85 × 10−3; CIMA: p = 3.18 × 10−2; scAtlas Normal: p = 1.04 × 10−4; scAtlas Cancer: p = 1.06 × 10−5). Inflammation Atlas Main is the sole exception (U = 14,101, p = 0.548, not significant) and the only dataset where CytoSig’s median ρ (0.323) exceeds SecAct’s (0.173).
Matched comparison (Wilcoxon signed-rank test): Restricts to the 32 targets shared between both methods (22 direct + 10 alias-resolved), each target serving as its own control. SecAct’s median ρ is consistently higher across all 6 datasets, reaching significance in 5 (GTEx: p = 3.54 × 10−5; TCGA: p = 3.24 × 10−6; CIMA: p = 2.28 × 10−2; scAtlas Normal: p = 3.54 × 10−5; scAtlas Cancer: p = 3.54 × 10−5). Inflammation Atlas Main is not significant (p = 0.141).
Inflammation Atlas Main is largely blood-derived, so many SecAct targets that perform well in multi-organ contexts contribute near-zero or negative correlations here. In fact, 99 SecAct targets are negative only in Inflammation Atlas Main but positive in all other datasets, reflecting tissue-specific expression limitations rather than inference failure. The “Matched” tab above demonstrates the fair comparison on equal footing.
4.3 Per-Tissue and Per-Cancer Stratified Validation [Statistical Methods]
Stratified validation: Instead of aggregating tissues/cancers into a single median-of-medians, this view shows the CytoSig vs SecAct comparison within each individual tissue (GTEx) or cancer type (TCGA). Mann-Whitney U test (Total tab: all targets) and Wilcoxon signed-rank test (Matched tab: 32 shared targets) with BH-FDR correction across all strata within each dataset.
Key insight: On the 32 matched targets, SecAct wins direction in every stratum — 29/29 GTEx tissues and 33/33 TCGA cancer types — with 25/29 and 31/33 reaching significance (q<0.05). This unanimous result across 62 independent strata rules out Simpson’s paradox. On total targets, SecAct wins in 28/29 GTEx tissues (21 significant) and 30/33 TCGA cancers (15 significant); the few CytoSig-favored strata (Brain in GTEx; Kidney Chromophobe, Ovarian, Uveal Melanoma in TCGA) are all non-significant. Since SecAct outperforms CytoSig on the same 32 cytokines, the advantage is not about target breadth but about signature quality. SecAct’s spatial-transcriptomics-derived signatures (Visium) capture tissue-context-dependent cytokine regulation that CytoSig’s case-control cytokine treatment experiments might not capture. The advantage is largest in tissues with complex cellular microenvironments (GTEx: Small Intestine Δ=+0.47, Esophagus +0.41; TCGA: Testicular +0.33, Cervical +0.32) and smallest in homogeneous contexts (GTEx: Breast +0.001, Pituitary +0.06; TCGA: Brain Glioma +0.06, Kidney Chromophobe +0.09).
4.4 Cross-Platform Comparison: Bulk vs Pseudobulk [Statistical Methods]
Cross-platform concordance: This section tests whether expression–activity relationships replicate across measurement technologies. For each tissue (GTEx) or cancer type (TCGA), we compute per-target Spearman ρ from bulk RNA-seq data and compare it to the same target’s ρ from single-cell pseudobulk data (scAtlas). Wilcoxon signed-rank tests (paired by target) with BH-FDR correction assess whether ρ values differ between platforms.
Key finding: Using all targets, SecAct shows significant bulk–pseudobulk differences in most strata (11/13 GTEx tissues, 5/11 TCGA cancers), while CytoSig shows almost none (1/13, 0/11). However, the Matched tabs reveal this is a statistical power effect, not a signal quality difference: when restricted to the same 32 shared targets, both CytoSig and SecAct show no significant platform differences (0/13 and 0/13 for GTEx; 0/11 and 1/11 for TCGA). The apparent platform sensitivity in SecAct’s full panel is a statistical power effect, not a signal quality difference: matched and unmatched SecAct targets show the same per-target platform shift (mean |Δ| = 0.298 vs 0.302, Mann–Whitney p = 0.82), but SecAct’s ~1,000 paired targets per tissue provide 25× more observations than CytoSig’s ~40, easily detecting the same tiny systematic shift (Δ ≈ 0.03) that CytoSig lacks power to detect. Core cytokine targets are platform-robust.
4.5 Best and Worst Correlated Targets
Consistently well-correlated targets:
- IL1B (ρ = 0.67 CIMA, 0.68 Inflammation Atlas Main, 0.72 scAtlas Cancer) — canonical inflammatory cytokine
- TNFA (ρ = 0.63 CIMA, 0.58 Inflammation Atlas Main, 0.55 GTEx) — master inflammatory regulator
- VEGFA (ρ = 0.79 Inflammation Atlas Main, 0.38 GTEx) — angiogenesis factor
- TGFB1/3 (ρ = 0.05–0.56, dataset-dependent; TGFB2 not in CytoSig panel)
- BMP2/4 (ρ = −0.02–0.61, dataset-dependent)
Dataset-dependent targets (negative in single-cell, positive in bulk):
- CD40L: −0.48 CIMA, −0.55 Inflammation Atlas Main, but +0.57 GTEx, +0.40 TCGA
- TRAIL: −0.45 CIMA, −0.54 Inflammation Atlas Main, but +0.58 GTEx, +0.31 TCGA
- LTA: −0.33 CIMA, but +0.26 TCGA; HGF: −0.25 CIMA, −0.29 Inflammation Atlas Main, but +0.40 GTEx
Platform-dependent pattern: Gene mapping is verified (CD40L→CD40LG, TRAIL→TNFSF10, LTA→LTA, HGF→HGF). The negative single-cell correlations likely reflect pseudobulk aggregation effects—membrane shedding (CD40L), decoy receptor sequestration (TRAIL), heteromeric complex dependence (LTA/LTB), and paracrine topology (HGF: fibroblast→epithelial) disproportionately affect cell-level inference. Bulk RNA-seq, which averages across tissue, captures the net activity signal and yields positive correlations for the same targets.
SecAct achieves consistent positive ρ across all 6 datasets for these targets, while CytoSig performance is platform-dependent (mean ρ across 6 datasets):
| Target | CytoSig Mean ρ | SecAct Mean ρ |
|---|---|---|
| CD40LG | +0.02 | +0.46 |
| TNFSF10 | −0.00 | +0.44 |
| LTA | −0.02 | +0.53 |
| HGF | +0.06 | +0.58 |
SecAct’s spatial co-expression signatures (Moran’s I from Visium data) capture tissue-level gene–protein relationships regardless of membrane shedding, proteolytic activation, or paracrine topology. Select “SecAct” in the dropdown to verify interactively.
4.6 Cross-Atlas Consistency
4.7 Effect of Aggregation Level [Statistical Methods]
Aggregation levels explained: Pseudobulk profiles are aggregated at increasingly fine cell-type resolution. At coarser levels, each pseudobulk profile averages more cells, yielding smoother expression estimates but masking cell-type-specific signals. At finer levels, each profile is more cell-type-specific but based on fewer cells.
| Atlas | Level | Description | N Cell Types |
|---|---|---|---|
| CIMA | Donor Only | Whole-sample pseudobulk per donor | 1 (all) |
| Donor × L1 | Broad lineages (B, CD4_T, CD8_T, Myeloid, NK, etc.) | 7 | |
| Donor × L2 | Intermediate (CD4_memory, CD8_naive, DC, Mono, etc.) | 28 | |
| Donor × L3 | Fine-grained (CD4_Tcm, cMono, Switched_Bm, etc.) | 39 | |
| Donor × L4 | Finest marker-annotated (CD4_Th17-like_RORC, cMono_IL1B, etc.) | 73 | |
| Inflammation Atlas Main | Donor Only | Whole-sample pseudobulk per donor | 1 (all) |
| Donor × L1 | Broad categories (B, DC, Mono, T_CD4/CD8 subsets, etc.) | 18 | |
| Donor × L2 | Fine-grained (Th1, Th2, Tregs, NK_adaptive, etc.) | 65 | |
| scAtlas Normal | Donor × Organ | Per-organ pseudobulk (Bladder, Blood, Breast, Lung, etc.) | 25 organs |
| Donor × Organ × CT1 | Broad cell types within each organ | 191 | |
| Donor × Organ × CT2 | Fine cell types within each organ | 356 | |
| scAtlas Cancer | Tumor Only | Whole-sample pseudobulk per tumor donor | 1 (all) |
| Tumor × Cancer | Per-cancer type pseudobulk (HCC, PAAD, CRC, etc.) | 29 types | |
| Tumor × Cancer × CT1 | Broad cell types within each cancer type | ~120 |
4.8 Representative Scatter Plots
4.9 Biologically Important Targets Heatmap
How each correlation value is computed: For each (target, atlas) cell, we compute Spearman rank correlation between predicted cytokine activity (ridge regression z-score) and target gene expression across all donor-level pseudobulk samples. Specifically:
- Pseudobulk aggregation: For each atlas, gene expression is aggregated to the donor level (one profile per donor or donor × cell type).
- Activity inference: Ridge regression (
secactpy.ridge, λ=5×105) is applied using the signature matrix (CytoSig: 4,881 genes × 43 cytokines; SecAct: 7,919 genes × 1,170 targets) to predict activity z-scores for each pseudobulk sample. - Correlation: Spearman ρ is computed between the predicted activity z-score and the original expression of the target gene across all donor-level samples within that atlas. A positive ρ means higher predicted activity tracks with higher target gene expression.
GTEx uses per-tissue pseudobulk (median-of-medians across 29 tissues); TCGA uses per-cancer type (median-of-medians across 33 cancers); CIMA/Inflammation Atlas Main use donor-only; scAtlas Normal uses donor-only; scAtlas Cancer uses tumor-only.
4.10 Per-Target Correlation Rankings
5. CytoSig vs LinCytoSig vs SecAct Comparison
5.1 Method Overview
| Method | Targets | Genes | Specificity | Selection |
|---|---|---|---|---|
| CytoSig | 43 cytokines | 4,881 curated | Global (all cell types) | — |
| LinCytoSig (orig) | 178 (45 CT × cytokines) | All ~20K | Cell-type specific | Matched cell type |
| LinCytoSig (gene-filtered) | 178 | 4,881 (CytoSig overlap) | Cell-type specific | Matched cell type |
| LinCytoSig Best (combined) | 43 (1 per cytokine) | All ~20K | Best CT per cytokine | Max combined GTEx+TCGA ρ |
| LinCytoSig Best (comb+filt) | 43 (1 per cytokine) | 4,881 (CytoSig overlap) | Best CT per cytokine | Max combined ρ (filtered) |
| LinCytoSig Best (GTEx) | 43 (1 per cytokine) | All ~20K | Best CT per cytokine | Max GTEx ρ |
| LinCytoSig Best (TCGA) | 43 (1 per cytokine) | All ~20K | Best CT per cytokine | Max TCGA ρ |
| LinCytoSig Best (GTEx+filt) | 43 (1 per cytokine) | 4,881 (CytoSig overlap) | Best CT per cytokine | Max GTEx ρ (filtered) |
| LinCytoSig Best (TCGA+filt) | 43 (1 per cytokine) | 4,881 (CytoSig overlap) | Best CT per cytokine | Max TCGA ρ (filtered) |
| SecAct | 1,170 secreted proteins | Spatial Moran’s I | Global (all cell types) | — |
Gene filter: LinCytoSig signatures restricted from ~20K to CytoSig’s 4,881 curated genes. Best selection: For each cytokine, test all cell-type-specific LinCytoSig signatures and select the one with the highest bulk RNA-seq correlation. “Combined” uses pooled GTEx+TCGA; “GTEx” and “TCGA” select independently per bulk dataset. “+filt” variants apply the same cell-type selection but restrict to CytoSig gene space. See LinCytoSig Methodology for details.
Ten methods compared on identical matched pairs across 4 combined datasets:
- CytoSig — 43 cytokines, 4,881 curated genes, global (all cell types)
- LinCytoSig (orig) — cell-type-matched signatures, all ~20K genes
- LinCytoSig (gene-filtered) — cell-type-matched signatures, restricted to CytoSig’s 4,881 genes
- LinCytoSig Best (combined) — best cell-type signature per cytokine (selected by combined GTEx+TCGA bulk ρ), all ~20K genes
- LinCytoSig Best (comb+filt) — best combined bulk signature, restricted to 4,881 genes
- LinCytoSig Best (GTEx) — best per cytokine selected by GTEx-only bulk ρ, all ~20K genes
- LinCytoSig Best (TCGA) — best per cytokine selected by TCGA-only bulk ρ, all ~20K genes
- LinCytoSig Best (GTEx+filt) — GTEx-selected best, restricted to 4,881 genes
- LinCytoSig Best (TCGA+filt) — TCGA-selected best, restricted to 4,881 genes
- SecAct — 1,170 secreted proteins (Moran’s I), subset matching CytoSig targets
Key findings:
- SecAct achieves the highest median ρ across all 4 combined datasets, benefiting from spatial-transcriptomics-derived signatures.
- CytoSig outperforms most LinCytoSig variants at donor level, with one notable exception: scAtlas Normal Best-orig (0.298) exceeds CytoSig (0.216).
- Gene filtering improves LinCytoSig in most datasets (CIMA +102%, Inflammation Atlas Main), confirming noise reduction from restricting the gene space.
- GTEx-selected best performs comparably to combined-selected in most datasets but slightly better in scAtlas Cancer (0.300 vs 0.275). TCGA-selected best generally underperforms other selection strategies, suggesting GTEx’s broader tissue coverage provides more generalizable selections.
- Gene filtering of GTEx/TCGA-selected: GTEx+filt and TCGA+filt show mixed results — filtering sometimes improves (e.g., TCGA+filt in Inflammation Atlas Main: 0.260 vs TCGA-orig 0.168) but can also reduce performance, indicating the optimal gene space depends on both the selection dataset and target dataset context.
- General ranking: SecAct > CytoSig > LinCytoSig Best variants > LinCytoSig (filt) > LinCytoSig (orig), though dataset-specific exceptions exist.
5.2 Effect of Aggregation Level
Methodology: At each cell-type aggregation level (CIMA: L1–L4 = 7–73 cell types; Inflammation: L1–L2; scAtlas: CT1–CT2 = coarse/fine), we match CytoSig, LinCytoSig, and SecAct on identical (cytokine, cell type) pairs — using the exact same pseudobulk samples and identical n for all three methods. For each pair, Spearman ρ measures agreement between predicted activity and target gene expression. If lineage-specific aggregation helps, LinCytoSig should increasingly outperform CytoSig as cell-type resolution increases (L1 → L4).
5.2.1 Distribution at Each Level
5.2.2 Summary
n = number of three-way matched pairs. Δρ = LinCytoSig − competitor (negative = LinCytoSig underperforms).
5.2.3 Which Cell Types Benefit?
Aggregated across all datasets at finest celltype level. Green = LinCytoSig wins more; red = LinCytoSig loses more.
5.2.4 Which Cytokines Benefit?
Sorted by mean Δρ vs CytoSig (best to worst).
Key finding: Lineage-specific aggregation provides no systematic advantage at any level.
- At every level, LinCytoSig underperforms CytoSig (mean Δρ ranges from −0.08 at coarse L1 to −0.02 at fine L4 in CIMA). Finer cell types reduce the gap slightly but never close it.
- SecAct wins at every level in CIMA and scAtlas. In Inflammation Atlas Main L2, LinCytoSig is nearly tied with SecAct (Δρ = +0.01) but still loses to CytoSig.
- Per cell type: Only 5 of 43 cell types show consistent LinCytoSig advantage vs CytoSig (NK Cell, Basophil, DC, Trophoblast, Arterial Endothelial). No cell type beats SecAct.
- Interpretation: CytoSig’s global signature, derived from median log2FC across all cell types, already captures the dominant transcriptional response. Restricting to a single cell type’s response introduces noise from small sample sizes without gaining meaningful lineage specificity. The hypothesis that finer resolution should favor LinCytoSig is not supported by the data.
5.3 SecAct: Breadth Over Depth
- Highest median ρ in single-cell datasets (scAtlas Normal: 0.455, Cancer: 0.399, independence-corrected)
- Highest median ρ in bulk RNA-seq (GTEx: 0.314, TCGA: 0.357, independence-corrected median-of-medians)
- 95.8% positive correlation in TCGA (independence-corrected)
- Wins decisively at celltype level against both CytoSig and LinCytoSig in scAtlas (19/3 wins vs CytoSig in scAtlas Normal, 20/2 in Cancer)
6. Key Takeaways for Scientific Discovery
6.1 What CytoAtlas Enables
- Quantitative cytokine activity per cell type per disease — 43 CytoSig cytokines + 1,170 SecAct secreted proteins across 29M cells
- Cross-disease comparison — same signatures validated across 20 diseases, 35 organs, 15 cancer types
- Independent perturbation comparison — parse_10M provides 90 cytokine perturbations × 12 donors × 18 cell types for independent comparison with CytoSig predictions
- Multi-level validation — donor, donor × celltype, bulk RNA-seq (GTEx/TCGA), and resampled bootstrap validation across 6 datasets
6.2 Limitations
- Linear model: Cannot capture non-linear cytokine interactions
- Transcriptomics-only: Post-translational regulation invisible
- Signature matrix bias: Underrepresented cell types have weaker signatures
- Validation metric: Expression-activity correlation underestimates true accuracy (signatures acting through downstream effectors are not captured)
6.3 Future Directions
- scGPT cohort integration (~35M cells)
- cellxgene Census integration
- Classification of cytokine blocking therapy
7. Appendix: Technical Specifications
A. Computational Infrastructure
- GPU: NVIDIA A100 80GB (SLURM gpu partition)
- Memory: 256–512GB host RAM per node
- Pipeline: 24 Python scripts, 18 pipeline subpackages (~18.7K lines)
- API: 262 REST endpoints across 17 routers
- Frontend: 12 pages, 122 source files, 11.4K LOC
B. Statistical Methods
- Activity inference: Ridge regression (λ=5×105, z-score normalization, permutation-based significance)
- Correlation: Spearman rank correlation
- Multiple testing: Benjamini-Hochberg FDR (q < 0.05)
- Bootstrap: 100–1000 resampling iterations
- Differential: Wilcoxon rank-sum test with effect size
C. Detailed System Architecture
This section provides an in-depth technical description of each layer in the CytoAtlas platform, covering the two primary software packages (cytoatlas-pipeline and cytoatlas-api) and how they work together to serve cytokine activity data to end users.
C.1 How CytoAtlas Serves Users
CytoAtlas is a full-stack bioinformatics platform that transforms raw single-cell RNA-seq data into actionable cytokine and secreted protein activity scores, then makes those results explorable through a web application and an AI-powered chat assistant. The system operates in two phases:
- Offline phase (cytoatlas-pipeline): GPU-accelerated batch jobs on SLURM/A100 infrastructure process ~29M single cells and ~31K bulk RNA-seq samples through ridge regression against 3 signature matrices, producing activity z-scores, cross-sample correlations, and validation metrics. Results are stored in DuckDB columnar databases.
- Online phase (cytoatlas-api): A FastAPI server reads pre-computed results from DuckDB and serves them to a React single-page application. Users can interactively explore activity patterns, compare across datasets, search genes, export data, and ask natural-language questions to an AI chat assistant backed by dual LLMs and a RAG knowledge base.
C.2 Offline Pipeline (cytoatlas-pipeline)
The pipeline package (cytoatlas-pipeline, 91 Python modules organized into 14 subpackages) orchestrates the entire compute-intensive data processing workflow.
Pipeline stages (25 numbered scripts executed via SLURM):
| Stage | Scripts | Operation | Output |
|---|---|---|---|
| 1. Activity inference | 01–05 | Ridge regression on each dataset (CIMA, Inflammation, scAtlas Normal/Cancer, parse_10M) against CytoSig (43), LinCytoSig (178), and SecAct (1,170) signatures | Activity H5AD files (signatures × samples) |
| 2. Pseudobulk aggregation | 11–12 | Multi-level aggregation: donor-only, donor×celltype (L1–L4). Single-pass streaming through H5AD with min_cells=10 filter | Pseudobulk H5AD per (atlas, level) |
| 3. Cross-sample correlation | 12–13 | Spearman ρ between predicted activity and target gene expression for each (atlas, level, signature). Alias resolution for CytoSig target names (e.g., TNFA → TNF) | Correlation CSV files |
| 4. Bulk validation | 14–15 | Ridge regression on GTEx (19.8K TPM samples) and TCGA (11.1K RSEM samples) with format-specific preprocessing and gene ID mapping (versioned ENSG for GTEx, symbol|entrezID for TCGA) | Bulk activity H5AD + validation JSON |
| 5. Bootstrap validation | 16 | 100 bootstrap resamples of pseudobulk for confidence intervals on correlation estimates | Resampled activity with 95% CI |
| 6. Visualization prep | 06, 17, 24–25 | Flatten nested results into JSON arrays for web visualization; generate scatter plot point clouds | Flat JSON files |
| 7. Database generation | convert_*_to_duckdb.py | Import JSON/CSV into DuckDB columnar tables with indexes; 10× compression vs raw JSON | 3 DuckDB databases |
SecActpy ridge regression is the core mathematical operation. For each dataset, the expression matrix Y (genes × samples) is regressed against a signature matrix X (genes × signatures) using L2-regularized linear regression:
β = (XTX + λI)−1 XTY (λ = 5×105)
Significance is assessed via 1,000 permutation tests (random column shuffles of Y), yielding z-scores and p-values for each signature×sample pair. The function returns beta, se, zscore, and pvalue matrices. For datasets exceeding 1,000 samples, ridge_batch() streams the computation in chunks of 5,000–10,000 samples while keeping the precomputed projection matrix T = (XTX + λI)−1XT on the GPU, achieving 10–34× speedup over NumPy via CuPy’s cuBLAS backend.
SLURM resource profiles:
| Profile | GPU | RAM | CPUs | Time Limit | Use Case |
|---|---|---|---|---|---|
| gpu_heavy | A100 | 128 GB | 16 | 24–48h | Atlas activity inference |
| gpu_medium | A100 | 128 GB | 8 | 4–8h | Validation, bulk processing |
| cpu_heavy | — | 128 GB | 16 | 6h | Preprocessing, aggregation |
| cpu_normal | — | 64 GB | 8 | 2–4h | Flattening, DuckDB conversion |
GPU nodes require module load CUDA/12.8.1 cuDNN and the secactpy conda environment for libcublas.so.12 and CuPy availability.
Atlas registry (cytoatlas-pipeline/batch/atlas_config.py) defines each dataset’s H5AD path, annotation levels, sample column, and cell count. Six atlases are registered: CIMA (6.5M cells, 4 annotation levels L1–L4), Inflammation Main/Val/Ext (4M/1.5M/0.8M cells, 2 levels), scAtlas Normal and Cancer (3M/3.4M cells each, organ×celltype levels).
C.3 Client Layer: React SPA
The frontend is a React 19.2.0 single-page application built with TypeScript 5.9, Vite 7.3, and Tailwind CSS 4.1. Users interact with CytoAtlas exclusively through this browser-based interface.
Pages (15 routes via React Router 7.13):
| Page | Function | Key Visualizations |
|---|---|---|
| Home | Landing page with atlas overview cards | Atlas summary statistics |
| Explore | Browse available datasets and signatures | Filterable data tables |
| Atlas Detail | Deep-dive into a single atlas (CIMA, Inflammation, scAtlas) with tabbed panels | Activity heatmaps, disease comparisons, cell-type profiles |
| Search | Global search across genes, cytokines, proteins, cell types, diseases, organs | Ranked result cards |
| Gene Detail | Gene-centric view: CytoSig activity, SecAct activity, correlations, diseases, expression | Multi-tab gene explorer |
| Compare | Cross-atlas comparison: conserved signatures, cell-type mapping, meta-analysis | Side-by-side boxplots, Sankey diagrams |
| Validate | Data quality metrics and validation results | Correlation scatter, quality grades |
| Perturbation | Cytokine stimulation experiments (parse_10M, Tahoe) | Dose-response curves, drug sensitivity |
| Spatial | Spatial transcriptomics data | 2D/3D tissue maps, gene coverage |
| Chat | AI assistant for natural-language data queries | Inline Plotly/D3 charts generated by LLM tool calls |
| Submit | User dataset submission workflow | Upload form, progress tracking |
State management: Zustand 5.0 stores global UI state (current atlas, selected signature type, sidebar/filter state, user session). TanStack React Query 5.90 manages server state with automatic caching, background refetching, and optimistic updates. API calls go through a typed client (src/api/client.ts) using the Fetch API with JWT Bearer or API Key authentication.
Visualization: 12 chart component types built on Plotly.js 3.3 and D3.js 7.9: bar, scatter, heatmap, boxplot, violin, forest plot, volcano, Sankey, lollipop, and line charts. Each atlas has 5–6 specialized panels (e.g., CIMA: biochemistry, eQTL, metabolites, population; Inflammation: disease, treatment, severity, drivers; scAtlas: immune infiltration, exhaustion, CAF signatures).
C.4 Gateway Layer: Nginx Reverse Proxy
In production, Nginx sits between the browser and FastAPI, providing rate limiting, connection limiting, SSL termination, and static file caching. It is deployed as an optional Docker Compose service (production profile).
| Feature | Configuration |
|---|---|
| Rate limiting | 10 requests/second per IP, burst of 20 (nodelay) |
| Connection limiting | 10 concurrent connections per IP |
| Gzip compression | Level 6, min 1,000 bytes; JSON, CSS, JS types |
| Upstream keepalive | 32 persistent connections to FastAPI (port 8000) |
| Timeouts | 60s connect/send/read |
| Static file cache | 30-day expiry, immutable headers for /static/ |
| Health checks | /api/v1/health excluded from rate limiting |
| SSL/TLS | TLS 1.2+ template (production) |
C.5 Application Layer: FastAPI
The FastAPI application (cytoatlas-api/app/main.py) is the central routing and orchestration layer. It processes all API requests, enforces authentication, and dispatches to the appropriate service.
Middleware stack (7 layers, applied in reverse order):
- GZipMiddleware — compresses responses > 1,000 bytes
- SecurityHeadersMiddleware (53 lines) — sets CSP (allows Plotly CDN, D3), X-Content-Type-Options, X-Frame-Options, HSTS (production), Permissions-Policy
- CORSMiddleware — configurable allowed origins
- AuditMiddleware (220 lines) — logs all API calls to rotating JSONL files (100 MB max, 5 backups) with token redaction
- RequestLoggingMiddleware (164 lines) — HTTP request timing and error tracking
- MetricsMiddleware — Prometheus metrics collection
- Cache headers — adds
X-API-Version: 1.0
Authentication: JWT tokens (HS256, 30-min expiry) stored in HttpOnly cookies, plus API key support (PBKDF2-SHA256 hashed, prefix-indexed for O(1) lookup via X-API-Key header). Password hashing uses bcrypt. Role-based access control (RBAC) with 4 roles: admin, analyst, viewer, contributor.
Routers (17 modules, ~5,900 lines, ~260+ endpoints):
| Router | Endpoints | Purpose |
|---|---|---|
| Health | 3 | Liveness, readiness, status probes |
| Auth | 4 | Login, register, verify, logout |
| Atlases | 8 | Atlas registry, summaries, metadata |
| CIMA | ~32 | CIMA-specific: activity, eQTL, biochemistry, metabolites, population |
| Inflammation | ~44 | Disease activity, treatment response, severity, cell drivers |
| scAtlas | ~36 | Organ signatures, cancer comparison, immune infiltration, exhaustion |
| Cross-Atlas | ~28 | Conserved signatures, cell-type mapping, meta-analysis |
| Validation | ~30 | 5-type credibility assessment, quality grades |
| Gene | 8 | Gene-centric views and associations |
| Search | 4 | Global entity search (gene, cytokine, cell type, disease, organ) |
| Export | 6 | CSV/JSON data export per atlas |
| Chat | 4 | AI conversation, suggestions, history, streaming |
| Perturbation | 12+ | parse_10M cytokine stimulation, Tahoe drug sensitivity |
| Spatial | 12+ | Tissue maps, gene coverage, technology comparison |
| Submit | 4 | User dataset submission workflow |
| WebSocket | 2 | Real-time streaming connections |
| Pipeline | 4 | Pipeline status and management |
Lifespan management: On startup, the application initializes the database (if configured), connects to Redis or the in-memory cache, starts the audit logger, and generates a runtime secret in development mode. On shutdown, it gracefully disconnects all resources.
C.6 Data Query Service
All science data queries go through the DuckDB repository (app/repositories/duckdb_repository.py), which provides a safe, async interface to the 3 pre-computed columnar databases.
Three DuckDB databases:
| Database | Content | Key Tables |
|---|---|---|
atlas_data.duckdb | Activity scores, correlations, validation, biological associations | activity, cross_sample_correlations, bulk_rnaseq_validation, celltype_specific_activity, age_bmi_data, cima_correlations, inflammation_disease, scatlas_cancer_comparison |
perturbation_data.duckdb | Cytokine stimulation and drug experiments | parse10m_activity, parse10m_treatment_effect, parse10m_ground_truth, tahoe_drug_effect, tahoe_dose_response, tahoe_drug_sensitivity |
spatial_data.duckdb | Spatial transcriptomics data | spatial_activity, spatial_coordinates, spatial_neighborhood, spatial_gene_coverage, spatial_technology_comparison |
Query safety: A safelist of 115 known tables is enforced before any query execution — queries referencing unlisted tables are rejected. All user input is passed via parameterized queries (never string interpolation), preventing SQL injection. For large result sets, batch-based streaming returns 2,000 rows per chunk. All DuckDB calls are wrapped in asyncio.run_in_executor() to avoid blocking the FastAPI event loop, and databases are opened in read-only mode.
C.7 AI Chat Service
The chat service allows users to ask natural-language questions about CytoAtlas data. It is implemented as a modular pipeline: ChatService (orchestrator) → LLMClient (language model) + RAGService (knowledge retrieval) + ToolExecutor (22 data tools).
Dual LLM client: The primary model is Mistral-Small-24B (Mistral-Small-3.1-24B-Instruct-2503) served via vLLM with an OpenAI-compatible endpoint. If the vLLM server is unreachable (connection timeout or error), the client automatically falls back to Claude Sonnet (claude-sonnet-4-5-20250929) via the Anthropic API. Both models receive the same system prompt (~150 lines) defining atlas descriptions, signature types, a 3-part response pattern (analysis plan → tool execution → results interpretation), and visualization rules per data type. A JSON repair strategy handles Mistral’s occasionally malformed streaming output by scoring candidate JSON substrings against expected tool parameter keys.
RAG (Retrieval-Augmented Generation): A LanceDB vector database stores embeddings of platform documentation, column definitions, atlas summaries, and biological context (~100 indexed documents). Queries are embedded using the all-MiniLM-L6-v2 sentence transformer (384-dim, 38M parameters, CPU-friendly) and the top-5 most similar documents are retrieved via cosine similarity to augment the LLM’s context.
22 data tools available to the LLM for answering questions:
| Category | Tools | Examples |
|---|---|---|
| Data Query (11) | search_entity, get_atlas_summary, list_cell_types, list_signatures, get_activity_data, get_correlations, get_disease_activity, compare_atlases, get_validation_metrics, export_data, create_visualization | “Show me TNF activity in macrophages across all atlases” |
| Documentation (6) | get_data_lineage, get_column_definition, find_source_script, list_panel_outputs, get_dataset_info, get_methodology | “How was the CIMA pseudobulk generated?” |
| Advanced (5) | advanced_query, create_comparison_report, suggest_analysis, generate_reproducible_code, submit_feedback | “Generate Python code to reproduce the IL1B correlation analysis” |
Tools execute asynchronously, with results truncated to 4,000 characters if needed. The ToolExecutor validates tool names, normalizes arguments (handling alias mappings), and formats structured responses. A ChatInputSanitizer checks for malicious patterns (SQL injection, path traversal) and enforces a 10,000-character input limit.
Conversation persistence: Chat history (conversations, messages, downloadable data) is stored in SQLite (default) or PostgreSQL (production) via SQLAlchemy async sessions.
C.8 Storage Layer
CytoAtlas uses a 3-tier storage architecture:
Tier 1 — Science data (DuckDB): Three separate .duckdb files store all pre-computed analytical data in columnar format. DuckDB was chosen because it requires no server process — essential on HPC where database servers are unavailable. Each database regenerates independently without affecting the others. Queries run as read-only in-process SQL, with parameterized inputs and safelist validation.
Tier 2 — Application state (SQLite/PostgreSQL): 10 ORM tables (659 lines, SQLAlchemy models) store users (email, password hash, API keys, roles), conversations and messages (chat history), jobs (background task tracking), computed statistics, validation metrics, and dataset/signature/cell-type metadata. SQLite with WAL mode is the default for HPC single-node deployment; PostgreSQL (via asyncpg, 5–10 connection pool) is used in Docker production.
Tier 3 — Cache (Redis/In-Memory): A CacheService (251 lines) with 4-tier strategy: L1 in-process Python dict (instant, ephemeral) → L2 Redis (fast, persistent) → L3 disk JSON (large results) → L4 DuckDB (source of truth). TTLs: summary stats 24h, heatmaps 1h, filtered results 5min, session data 30min. Redis runs with AOF persistence; in development, an in-memory dict with LRU eviction is used as a drop-in replacement.
C.9 Deployment
The platform supports two deployment modes:
Docker Compose (production): 5 services — FastAPI (Uvicorn on port 8000), PostgreSQL 16 (with healthcheck), Redis 7 (AOF persistence), Nginx (optional, production profile), and a Celery worker (optional, workers profile). The Dockerfile uses a 3-stage build: Node 22 for the frontend (npm run build → /app/static/), Python 3.11 with uv for the backend, and a minimal production image running as non-root appuser with a curl /api/v1/health/live healthcheck.
SLURM HPC (current): The API server runs as a 7-day SLURM job with SQLite and in-memory cache. Data volumes (/data/Jiang_Lab/, /data/parks34/projects/2cytoatlas/) are mounted read-only. GPU pipeline scripts run as separate SLURM jobs with dependency tracking via submit_jobs.py.
C.10 Request Lifecycle Example
When a user searches for “IFNG activity in CIMA”:
- Browser: React’s TanStack Query hook calls
GET /api/v1/cima/activity?signature=IFNG&sig_type=cytosig - Nginx: Checks rate limit (10 req/s), passes to upstream FastAPI
- FastAPI middleware: Logs request, checks JWT token, adds security headers, starts metrics timer
- CIMA router: Validates Pydantic request model, calls CIMAService
- CIMAService: Checks L1/L2 cache for this query
- DuckDB repository: If cache miss, runs parameterized SQL:
SELECT * FROM activity WHERE atlas='cima' AND signature='IFNG'viarun_in_executor() - Response: JSON with activity z-scores per cell type → cached → returned through middleware → gzip compressed → browser renders Plotly chart
For AI chat queries (e.g., “Which cell types have the highest IFNG activity in CIMA?”), the request goes through the chat router → ChatService, which embeds the query for RAG retrieval, builds the LLM context with relevant documentation, calls Mistral-24B (or Claude fallback), executes any tool calls (e.g., get_activity_data), and streams the response with inline visualizations back to the browser.